I wanted to gain some knowledge in data analysis for quite a long time. I found this course by simply surfing through list of courses with open registration date. This is an amazing opportunity to learn basics of up to date data analysis from experts.
The textbook is very well organized and easy to learn with. However it is always not so easy to digest the information about the tools you haven’t used yet.
https://github.com/SergeiRaik/IODS-project
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Mon Nov 28 23:33:47 2022"
This week I studied the regression models. Those are extremely important in data analysis as they bear several functions. They help us explain the relationship between variables and make a hypothesis how some parameters affect target variables. Another function is prediction. Using regression model we can predict unknown values of target variable by a set of explanatory vatiables. Here we used a dataset derived from 2014 year questionare aimed at identifying how various learning strategies (deep, surface and strategic) affect learning outcome (exam points).
# import necessary packages
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ tibble 3.1.8 ✔ purrr 0.3.5
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
#import dataset from file and set proper column types
learning2014 <- read_csv("data/learning2014.csv", col_types = cols(.default = "n", age = "i", gender = "f", points = "i"))
str(learning2014) #stucture of the dataset
## spc_tbl_ [166 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int [1:166] 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num [1:166] 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num [1:166] 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num [1:166] 3.37 2.75 3.62 3.12 3.62 ...
## $ surf : num [1:166] 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int [1:166] 25 12 24 10 22 21 21 31 24 26 ...
## - attr(*, "spec")=
## .. cols(
## .. .default = col_number(),
## .. gender = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
## .. age = col_integer(),
## .. attitude = col_number(),
## .. deep = col_number(),
## .. stra = col_number(),
## .. surf = col_number(),
## .. points = col_integer()
## .. )
## - attr(*, "problems")=<externalptr>
dim(learning2014) # dimensions of the dataframe
## [1] 166 7
The dataset consists of 7 variables and 166 observations. The variable gender, which is 2 levels factor is respondents’ gender. Integers age and points are respondents’ age in years, and exam points respectively. Numerical variables attitude, deep, stra, and surf are derived from respondents’ answers and scaled to 0-5 points scale. More information on the data can be found here. Briefly they represent three learning strategies (deep, surface and strategic) and general attitude towards statistics.
First let’s take a look on the summary of each variable.
summary(learning2014) # dataframe summary
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.667 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
It can be seen that data related to learning strategies and attitude has the same range (0-5). The number of female participants is twice as high as male.
Let’s try to plot variables with each other and try to visually observe any correlation.
pairs(learning2014[-1]) # plot every variable against each other excluding 1st column (gender)
Due to high variation it is hard to visually observe any correlation from plots. We can therefore compare data using the ggpairs() function from ggplot2 library which provides also variable distribution plots and correlation coefficients.
# create a more advanced plot matrix with ggpairs()
ggpairs(learning2014, mapping = aes(alpha = 0.3, col = gender), lower = list(combo = wrap("facethist", bins = 20)))
From variables distribution graphs shape we can conclude that all of them except age have more or less symmetrical bell shape. To prove that let’s make a QQ plot for each variable. Simply said, QQ plot compares the distribution with the normal distribution. The graph should be linear if two normally distributed values are compares.
par(mfrow = c(2,3)) # create a plot matrix
qqplot(learning2014$age # create a qq-plot for each variable to check if it is normally distributed
, rnorm(50))
qqplot(learning2014$attitude
, rnorm(50))
qqplot(learning2014$deep
, rnorm(50))
qqplot(learning2014$stra
, rnorm(50))
qqplot(learning2014$surf
, rnorm(50))
qqplot(learning2014$points
, rnorm(50))
Indeed, we can see that age is not normally distributed here. Other plots are linear.
Let’s create a linear regression model where exam points are target variable and explanatory variables would be attitude, stra and surf, based on their correlation coefficients from previous section.
# create a multivariable regression model
my_model <- lm(points ~ attitude + stra + surf, data = learning2014)
# create a model summary
summary(my_model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
From p-values we can conclude that the attitude has the highest influence on the target variable. stra and surf are insignificant. Therefore we can simplify the model to a single parameter without losing much predictability.
# create a simplified model with single variable
my_model2 <- lm(points ~ attitude, data = learning2014)
# create a model summary
summary(my_model2)
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Indeed, the R-squared decreased roughly from 0.20 to 0.19. However both models are not very precise as they explain ~20 % of points variability. From the regression model it can be concluded that global attitude towards statistics positively correlates with exam outcome, however the influence is not very large.
Let’s have a look at the plot of exam points vs attitude.
qplot(attitude, points, data = learning2014) + geom_smooth(method = "lm") # points vs attitude linear plot
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## `geom_smooth()` using formula = 'y ~ x'
Our linear model suggests that there is linear relationship between exam points and general attitude towards statistics. To prove that we can make diagnostic plots.
plot(my_model2, which = 1) # residuals vs fitted
Residuals vs. fitted plot shows the difference between observed and fitted values. If the observed value equals exactly the predicted value, the residual value is zero. From our plot we can conclude that the assumption that the relationship is linear is reasonable as residuals are randomly distributed around the zero line. We can also observe values that do not fit the model: in observations 35, 56 and 145 students got unexpectedly low exam points despite their high attitude.
plot(my_model2, which = 2) # QQ-plot
To check if the residuals are distributed normally, we made a QQ-plot where quantiles of residuals distribution are plotted against theoretical normal distribution quantiles. In our case the graph is fairly linear, however there is some deviation at its extremes.
plot(my_model2, which = 5) # residuals vs leverage plot
The Residuals vs Leverage plot shows us that none of the points fall outside the Cook’s distance, and therefore can not be considered influential observations.
This week we were working on a logistic regression. This is the method of estimation the relationships between categorical variables. This method is an extension of linear regression. The dataset we’ve been working with is describing student achievement in secondary education of two Portuguese schools. The dataset includes information regarding family status, school performance, spare time, alcohol consumption etc. The complete description is published on the UCI Machine Learning Repository.
Let’s import the wrangled dataset and take a look at it’s structure.
# import necessary packages
library(ggplot2)
library(dplyr)
library(GGally)
library(tidyverse)
#import dataset from file and set proper column types
alc <- read_csv("data/alc.csv", show_col_types = FALSE, col_types = cols(.default = "f", age = "i", Medu = "i", Fedu = "i", traveltime = "i", studytime = "i", famrel = "i", freetime = "i", goout = "i", Dalc = "i", Walc = "i", health = "i", failures = "n", absences = "n", G1 = "n", G2 = "n", G3 = "n", alc_use = "n", high_use = "l"))
#check dataset structure, dimensions and column types
glimpse(alc)
## Rows: 370
## Columns: 35
## $ school <fct> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP,…
## $ sex <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, F, M, M,…
## $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, 15,…
## $ address <fct> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U,…
## $ famsize <fct> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, GT3, GT3,…
## $ Pstatus <fct> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T, T, T, T,…
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4,…
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3,…
## $ Mjob <fct> at_home, at_home, at_home, health, other, services, other, …
## $ Fjob <fct> teacher, other, other, services, other, other, other, teach…
## $ reason <fct> course, course, other, home, home, reputation, home, home, …
## $ guardian <fct> mother, father, mother, mother, father, mother, mother, mot…
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1, 1,…
## $ studytime <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1,…
## $ schoolsup <fct> yes, no, yes, no, no, no, no, yes, no, no, no, no, no, no, …
## $ famsup <fct> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes, yes, ye…
## $ activities <fct> no, no, no, yes, no, yes, no, no, no, yes, no, yes, yes, no…
## $ nursery <fct> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, …
## $ higher <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes,…
## $ internet <fct> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes, yes, ye…
## $ romantic <fct> no, no, no, yes, no, no, no, no, no, no, no, no, no, no, ye…
## $ famrel <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3,…
## $ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1,…
## $ goout <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3,…
## $ Dalc <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1,…
## $ Walc <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4, 3,…
## $ health <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5,…
## $ failures <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,…
## $ paid <fct> no, no, yes, yes, yes, yes, no, no, yes, yes, yes, no, yes,…
## $ absences <dbl> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, 3, 9, 5,…
## $ G1 <dbl> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11, 14, 16,…
## $ G2 <dbl> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11, 15, 16…
## $ G3 <dbl> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 12, 16, 1…
## $ alc_use <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.5, 1.0,…
## $ high_use <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
The dataset consists of 35 variables and 370 observations.
Hypothesis 1 People involved in romantic relationships are less likely to be heavy drinkers.
alc %>% group_by(romantic,high_use) %>% summarise(count = n())
## `summarise()` has grouped output by 'romantic'. You can override using the
## `.groups` argument.
## # A tibble: 4 × 3
## # Groups: romantic [2]
## romantic high_use count
## <fct> <lgl> <int>
## 1 no FALSE 173
## 2 no TRUE 78
## 3 yes FALSE 86
## 4 yes TRUE 33
g <- ggplot(data = alc, aes(x=high_use))
g+facet_wrap("romantic")+geom_bar()
From both tabular and visual summary it is obvious that within both groups (single and people in relationships) the fraction of those who consume large amounts of
Hypothesis 2 People who go out with friends more often tend to drink more.
# group by frequency of going out and alcohol coonsumption
alc %>% group_by(goout,high_use) %>% summarise(count = n())
## `summarise()` has grouped output by 'goout'. You can override using the
## `.groups` argument.
## # A tibble: 10 × 3
## # Groups: goout [5]
## goout high_use count
## <int> <lgl> <int>
## 1 1 FALSE 19
## 2 1 TRUE 3
## 3 2 FALSE 82
## 4 2 TRUE 15
## 5 3 FALSE 97
## 6 3 TRUE 23
## 7 4 FALSE 40
## 8 4 TRUE 38
## 9 5 FALSE 21
## 10 5 TRUE 32
g <- ggplot(data = alc, aes(x=high_use))
g+facet_wrap("goout")+geom_bar()
Indeed, it is clearly seen that those who go out with friends more often, tend to drink more.
Hypothesis 3 People who spend more time studying, don’t have time for drinking.
alc %>% group_by(studytime,high_use) %>% summarise(count = n())
## `summarise()` has grouped output by 'studytime'. You can override using the
## `.groups` argument.
## # A tibble: 8 × 3
## # Groups: studytime [4]
## studytime high_use count
## <int> <lgl> <int>
## 1 1 FALSE 56
## 2 1 TRUE 42
## 3 2 FALSE 128
## 4 2 TRUE 57
## 5 3 FALSE 52
## 6 3 TRUE 8
## 7 4 FALSE 23
## 8 4 TRUE 4
The correlation can be observed here as well. As you can see, within group of students who spend less than 2 hours for studying, there’s almost a half of “heavy drinkers”, whereas for those who study more than 10 hours, it decreases to ~14 %.
g <- ggplot(data = alc, aes(x=high_use))
g+facet_wrap("studytime")+geom_bar()
Hypothesis 4 People who consume large amounts of alcohol, have lower grades.
To analyze grades, we first take an average of three grades and then compare the distributions across groups of people who consume alcohol a lot and a bit.
alc$grades <- rowMeans(alc[, c("G1","G2","G3")])
g <- ggplot(data = alc, aes(x=high_use, y = grades))
g + geom_boxplot()
From boxplot graph it is not obvious whether alcohol consumption have an effect on grades. It seems that for low-drinkers the distribution is broader and the mean value is higher whereas for high-drinkers the distribution of grades is tilted towards lower values.
alc %>% group_by(high_use) %>% summarise(mean(grades))
## # A tibble: 2 × 2
## high_use `mean(grades)`
## <lgl> <dbl>
## 1 FALSE 11.8
## 2 TRUE 10.9
Mean values of grades differ not that much. Let’s take a look at each distribution separately.
g <- ggplot(data = alc, aes(x=grades))
g + geom_density(alpha=0.2)+facet_grid(~high_use)
Here it seems that for low-drinkers there is a bimodal distribution. The mode with higher grades is absent in case of high-drinkers. However we’ll see whether the mentioned parameters are significant or not in regression model.
To begin with we create a model of high_use as a target variable and those mentioned in the hypotheses section as explanatory variables (grades, studytime, goout, romantic),
# create a logistic model and it's summary
m <- glm(high_use ~ grades + studytime + goout + romantic, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ grades + studytime + goout + romantic,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7852 -0.7805 -0.5395 0.9537 2.5739
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.43474 0.73913 -1.941 0.05224 .
## grades -0.05210 0.04664 -1.117 0.26400
## studytime -0.58343 0.16813 -3.470 0.00052 ***
## goout 0.72552 0.11848 6.123 9.16e-10 ***
## romanticyes -0.19076 0.27253 -0.700 0.48396
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 385.84 on 365 degrees of freedom
## AIC: 395.84
##
## Number of Fisher Scoring iterations: 4
# compute odds ratios (OR)
OR <- coef(m) %>% exp
# compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.2381778 0.05452509 0.9971120
## grades 0.9492380 0.86558442 1.0397804
## studytime 0.5579787 0.39679174 0.7685481
## goout 2.0658027 1.64754827 2.6242897
## romanticyes 0.8263314 0.48042098 1.4021977
When the odds ratio is close to 1, it means that changing explanatory variable the odds of target variable does not change. Grades and romantic relationships with OR of 0.95 and 0.83 respectively do not correlate with alcohol consumption. Meanwhile each point of ‘goout’ variable doubles alcohol consumption odds. On the contrary each point of studytime decreases alcohol consumption odds by half.
Let’s use our model for prediction of alcohol consumption level by calculating the probability of high_use with predict() function.
#calculate the probabilities of high_use using logistic model
probabilities <- predict(m, type = "response")
alc <- mutate(alc, probability = probabilities)
#prediction is true if probability > 0.5
alc <- mutate(alc, prediction = probability>0.5)
alc %>% group_by(high_use,prediction) %>% summarise(count = n())
## `summarise()` has grouped output by 'high_use'. You can override using the
## `.groups` argument.
## # A tibble: 4 × 3
## # Groups: high_use [2]
## high_use prediction count
## <lgl> <lgl> <int>
## 1 FALSE FALSE 238
## 2 FALSE TRUE 21
## 3 TRUE FALSE 70
## 4 TRUE TRUE 41
From the summarizing table we can conclude that when model predicts low level of alcohol consumption (FALSE), it is correct in 238 / (238 + 70) = 77 %. When high - 41 / (41 + 21) = 66 %. Overall the model prediction is wrong in (21 + 70) / (238 + 21 + 70 + 41) = 25 %. Overall if we compare the model with simple guessing strategy (0.5 % probability), it produces 25 % less wrong predictions which is not bad considering small number of variables taken. ***
The dataset we are exploring today contains information about different suburban areas near Boston. It includes information about environmental conditions of touns, population, characteristics of housing and transportation, crime rate etc. The full description can be found here.
Let’s load necessary libraries and the Boston dataset
# access the MASS package
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(tidyverse)
# load the data
data("Boston")
#explore structure and dimensions
glimpse(Boston)
## Rows: 506
## Columns: 14
## $ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829,…
## $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 1…
## $ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.…
## $ chas <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524,…
## $ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631,…
## $ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, 9…
## $ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9505…
## $ rad <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 31…
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15…
## $ black <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396.90…
## $ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17.10…
## $ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15…
The dataset consists of 14 variables and 506 observations.
Analyzing datasets it is always useful to have a look at variables distributions. To do that, we first convert data from wide format to long format (variable-value dictionary) using melt() function from reshape library. Then we visualize distributions using ggplot2 library.
library(ggplot2)
library(reshape)
##
## Attaching package: 'reshape'
## The following objects are masked from 'package:tidyr':
##
## expand, smiths
## The following object is masked from 'package:dplyr':
##
## rename
melt.boston <- melt(Boston)
## Using as id variables
ggplot(data = melt.boston, aes(x = value)) +
stat_density() +
facet_wrap(~variable, scales = "free")
library(MASS)
library(tidyr)
library(corrplot)
## corrplot 0.92 loaded
# calculate the correlation matrix and round it
cor_matrix <- cor(Boston)
# print the correlation matrix
cor_matrix %>%
round(digits = 2)
## crim zn indus chas nox rm age dis rad tax ptratio
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58 0.29
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31 -0.39
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72 0.38
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04 -0.12
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67 0.19
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29 -0.36
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51 0.26
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53 -0.23
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91 0.46
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 0.46
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 -0.18
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51
## black lstat medv
## crim -0.39 0.46 -0.39
## zn 0.18 -0.41 0.36
## indus -0.36 0.60 -0.48
## chas 0.05 -0.05 0.18
## nox -0.38 0.59 -0.43
## rm 0.13 -0.61 0.70
## age -0.27 0.60 -0.38
## dis 0.29 -0.50 0.25
## rad -0.44 0.49 -0.38
## tax -0.44 0.54 -0.47
## ptratio -0.18 0.37 -0.51
## black 1.00 -0.37 0.33
## lstat -0.37 1.00 -0.74
## medv 0.33 -0.74 1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
To scale the dataset we substract each variable mean value and divide by standard deviation (SD).
# center and standardize variables
boston_scaled <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix"
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
As a result of scaling the dataset, we derivatized variables in a following way. Mean value is zero and observations that deviate from mean by SD, is set to ±1.
To build up a model for prediction of crime rate we have to first turn it into categorical variable. To do that we make a quantile vector with the corresponding function.
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
Then we break the crim variable into 5-level factor, add substitute the initial variable in the scaled dataset.
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
Then we have to split the dataset into train and test samples.
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2400990 0.2549505 0.2549505 0.2500000
##
## Group means:
## zn indus chas nox rm age
## low 1.0992139 -0.9180158 -0.109974419 -0.8697067 0.5347381 -0.8816403
## med_low -0.1055088 -0.2798523 -0.004759149 -0.5385400 -0.1658617 -0.3028679
## med_high -0.3781741 0.2428217 0.262810770 0.3971722 0.0481647 0.4185395
## high -0.4872402 1.0171306 -0.155385496 1.0638394 -0.3635076 0.7921939
## dis rad tax ptratio black lstat
## low 0.7908651 -0.6846906 -0.7373948 -0.5408901 0.3739067 -0.79587223
## med_low 0.3524172 -0.5436697 -0.4405372 -0.0700480 0.3488275 -0.08236995
## med_high -0.4306980 -0.3953725 -0.2752081 -0.3140058 0.1392144 0.04280521
## high -0.8337660 1.6379981 1.5139626 0.7806252 -0.7684682 0.91251737
## medv
## low 0.6237035
## med_low -0.0358693
## med_high 0.1532996
## high -0.7646762
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.09706622 0.883334496 -0.94242333
## indus 0.02908800 -0.299005341 0.03801384
## chas -0.10357188 -0.132158731 0.02450388
## nox 0.41827024 -0.553512503 -1.25265429
## rm -0.07171519 0.017501315 -0.09048660
## age 0.25776905 -0.291202406 -0.08295926
## dis -0.05671087 -0.386126597 0.48550443
## rad 3.03325833 0.902362544 -0.10471743
## tax -0.08175475 -0.123037846 0.80500206
## ptratio 0.09860672 0.079471045 -0.24248560
## black -0.15891680 -0.008499813 0.15863175
## lstat 0.20384612 -0.201588997 0.37737472
## medv 0.13788742 -0.377841976 -0.26380794
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9369 0.0450 0.0181
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)
From the LDA plot we can clearly see that observations with high crime rate are very well separated. Others are overlapping which will clearly affect the prediction power of the model.
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 12 17 1 0
## med_low 4 16 3 0
## med_high 0 10 13 0
## high 0 0 0 26
Indeed, as you can see, the best results are obtained for high cluster, whereas for others there are a lot of false predictions.
data('Boston')
bst <- scale(Boston)
bst <- as.data.frame(bst)
dist_eu <- dist(bst)
km <- kmeans(x = bst, centers = 3)
Then we determine the optimal number of clusters by calculating the total sum of squares.
set.seed(21)
#set the max clusters numbers
k_max <- 10
#calculate the total sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(bst, k)$tot.withinss})
library(ggplot2)
qplot(x = 1:k_max, y = twcss, geom = 'line')
The total WCSS decreases dramatically at 2 which means it is the optimal number of clusters.
Lets run the code again and plot it:
km <-kmeans(bst, centers = 2)
pairs(bst, col = km$cluster)
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
First we set the color to be the crime classes of the train set.
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:reshape':
##
## rename
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
Second we make the color be defined by the clusters of the k-means
km1 <- dplyr::select(train, -crime)
km1 <- kmeans(km1, centers = 4)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km1$cluster)
The clusters are more or less similar, however observations with higher crime rate are better separated when using crime class. (more chapters to be added similarly as we proceed with the course!)